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Abstract 

Confident prediction is highly relevant in machine learning; for example, in applications 
such as medical diagnoses, wrong prediction can be fatal. For classification, there already 
exist procedures that allow to not classify data when the confidence in their prediction is 
weak. This approach is known as classification with reject option. In the present paper, 
we provide new methodology for this approach. Predicting a new instance via a confidence 
set, we ensure an exact control of the probability of classification. Moreover, we show that 
this methodology is easily implementable and entails attractive theoretical and numerical 
properties. 
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1 Introduction 

Binary classification aims at assigning a label Y G{0,l}toa given example X ^ X. The goal 
is then to build a classification rule s : V —{0,1} so that s(V), the predicted label for the 
observed example X is as close as possible to the label Y. In this framework the question of 
confident prediction, which results in wondering how accurate is the prediction s(X), becomes 
a central question. Doubts about the confidence of the predicted label s(V) may arise in these 
situations: if the conditional probability rf{x) = P(y = 1|V = x) is close to 1/2 so that the 
feature x might be hard to classify whatever the classification rule is; or, if the classification 
rule is inefficient. In such cases, it is worth considering procedures that allow to not classify an 
observation when the doubt is too important. We talk about classification with reject option. 
This setting is particularly relevant in some applications where wrong classification may lead 
to big issues: it is hence better to not assign a label rather than to assign a non confident 
one. Procedures for classification with reject option has been studied by several authors |Cho701 
INZH1 ni KIRKIMfll IV(IS99I IVdSnbI IHWnfil IBWnSI IWYIII IL^ and references therein. In this 
context, two questions arise: how to determine whether we should classify an example or not; 
and how to take into account the reject option? The works on classification with reject option 
can be separated according to two approaches. 

i) The works which rely on the conformal predictors algorithm |VGS991IVGS05j . The general 
idea of conformal prediction is to build for a given feature X a set r(V), which takes its value 
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in {0,{O},{1},{O,1}}, and contains the true label with high probability. The feature X is 
not classified if card(r(X)) ^ 1. One of the most important ideas behind the construction of 
conformal predictors is the notion of conformity. More precisely, the value of r(X) depends 
on the similarity between the example X and an already collected labeled dataset. Then, the 
procedure uses local arguments and can be seen as a transductive method |GKKW02l |Vap98l . 
In terms of performance, the set r(X) is built in order to control the overall misclassification 
risk: for a given significance level e € (0,1), the set r(X) satisfied P(y ^ < £• The 

major drawback of the conformal prediction approach is that it does not take into account the 
reject option in the risk. Moreover, if the significance level e is too small, the resulting set r(X) 
belongs to {0,{O,1}} for all X. Hence, the use of reject option is irrelevant. 

ii) The other works rely on the setting provided in |Cho70l IHW061IBW081 IWYllj . In this 
case, a classification rule with reject option sr takes its values in {0,1, i?e}, where sr(X) = Re 
means reject: no label is affected to the instance X. In the above mentioned works, rejecting is 
viewed as an error and for a fixed value of some parameter a £ [1/2,1], the cost of the rejection 
is 1 — a. Therefore, the risk function associated with a classifier with reject option is given 
by La{sii) = P({sfl(X) ^ Y} and {X is classified}) + (1 — q:)P({X is rejected}). The results 
provided in |Cho70| illustrate that the optimal reject procedure for La is given by 

fO iiri*{X) <l-a, 

[i?e otherwise. 

Herbei and Wegkamp |HW06| study the asymptotic optimality of procedures based on plug-in 
rules or on empirical risk minimization. We address some limits of this approach. First, the choice 
of the parameter a is fundamental for the procedure and fixing it is tricky. As an immediate 
consequence, if the value of parameter a is either too small or too large, the use of reject option 
can be irrelevant. Moreover, this approach does not allow to control any of the two parts of 
the risk function, in particular the rejection probability. Hence, comparing two classifiers with 
reject option in terms of the risk function La remains difficult to interpret: they do not have 
necessarily the same rejection probabilities. 

Both approaches previously presented bring into play the reject option through rather a set 
(conformal predictor) or a classifier with reject option. However, none provides a control on the 
probability of classifying a feature. In the present paper we consider a new way to tackle the 
problem of classification with reject option. We aim at controlling the rejection probability and 
at bounding the misclassification risk restricted to the set of label examples. Both considerations 
are new. For a given classifier s and a feature X, our methodology involves a statistical procedure 
which provides a set rs(A) € {{0}, {!}, {0,1}}, namely a confidence set. We introduce in the 
present work oracle confidence sets, says F*, which relies on a score function deduced from r]* and 
also on its cumulative distribution function. The main characteristic of oracle confidence sets 
is that they are able to control exactly the rejection probability P(r*(A) = {0,1}): under mild 
assumption, we get level-e-confidence sets. These sets are called e-confidence sets. This aspect 
makes our procedure prevent irrelevant use of reject option. Hence we do not view the reject 
as an error but simply as a parameter that we are able to control; moreover, we evaluate the 
quality of a confidence set through the misclassification risk conditionally on the set of classified 
examples. That is, for a given classification rule s, we focus on the control of the risk function 
77.(Fs) = P(F e rs(A) I {X is classified}). To the best of our knowledge, none of the earlier 

^In the terminology of conformal predictors, both of the outputs 0 and {0,1} mean that no label is assigned. 
Both are important to be able to guarantee the exact control of the overall misclassification risk regardless of the 
classification rule used to build the set r(X). 
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works provides a control of this risk neither of the rejection probability. According to the risk 
function TZ, for s €]0,1], the e-confidence sets are shown to be optimal over the set of all confidence 
sets with rejection probability equal to 1 — e. Another contribution of the paper is to provide an 
algorithm which involves a consistent estimator of rj* and yields a confidence set. For a given level 
e s]0,1], we do not build only a single algorithm of constructing asymptotically level-e-confidence 
sets, but a general device that takes as input a consistent estimator of the regression function and 
a unlabeled sample, and produces as output a confidence set which is provably asymptotically of 
level e and consistent (i.e., the excess risk tends to zero). The resulting confidence set is referred 
as plug-in e-confidence set. Furthermore, we establish rates of convergence under the Tsybakov 
noise assumption on the data generating distribution. Moreover, these confidence sets have the 
advantage of being easily implementable. 

The rest of the paper is organized as follows. The definition and the important properties 
of the e-confidence sets are provided in Section We also apply the e-confidence sets in the 
Gaussian mixture model. Section is devoted to the introduction of the plug-in e-confidence sets 
and their asymptotic behavior. We present a numerical illustration of our results in Section 
We finally draw some conclusions and present perspectives of our work in Section Proofs of 
our results are postponed to the Appendix. 

Notation: First, we state general notation. Let {X,Y) be the generic data-structure taking its 
values in A X {0,1} with distribution P. Let (X,, Y,) be a random variable independent of (X, Y) 
and with the same law as (X, X). The goal in classification is to predict the label Y, given an 
observation of X,. This is performed based on a classifier (or classification rule) s which is a 
function mapping X onto {0,1}. Let S be the set of all classifiers. The misclassification risk R 
associated with s G 5 is defined as 

i?(s) =P(s(X) ^ Y). 

Moreover, the minimizer of R over S is the Bayes classifier, denoted by s*, and is characterized 
by 

s*(-) = l{^*(-)>l/2}, 

where r]*{x) = P(X = 1|X = x) for x G X. One of the most important quantities in our 
methodology is the function f* defined by /*(•) = max{r 7 *(-), 1 — ??*(•)}• K will play the role of 
a score function. 

Let us now consider more specific notation related to the classification with reject option setting. 
Let s G 5 be a classifier. A confidence set F^ associated with the classifier s is defined as a 
measurable function that maps X onto {{0},{1},{0,1}}, such that for an example X,, the set 
rs(X,) can be either {s(X,)} or {0,1}. We decide to classify the example X,, according to the 
label s(X,), if card(Fs(X,)) = 1. In the case where rs(X,) = (0,1}, we decide to not classify 
(reject) the feature X,. Let Fg be a confidence set. The probability of classifying a feature is 
denoted by 

7^(Fg) :=P(card(Fg(X.)) = 1). (1) 

In our approach, the probability of classifying 7^ (Fg) is not viewed as a success or an error but 
simply as a parameter that we have to control. Hence, the definition of a confidence set makes 
natural the following definition of the risk associated with Fgi 

7^(^g) = P(X. ^rg(X.)|card(rg(X.)) = l) 

= P(s(X.) ^X.|card(Fg(X.)) = 1). (2) 

The risk TZ (Fg) is the misclassification error risk of s conditional to the event that X, is classified. 
Moreover, for some e g] 0, I], we say that, for two confidence sets Fg and F^' such that V (Fg) = 
7^(F^') = £, the confidence set Fg is “better” than F^/ if 7^(Fg) < 7?.(rj,/). 
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2 £-confidence sets 


In the present section, we define a class of confidence sets referred as e-confidence sets which are 
optimal according to the definition of risk ([^. We always keep in mind that the classification 
probability 0 will be under control. In Section [2.I[ we define and state the important properties 
of the class of e-confidence sets. We then apply the e-confidence sets to the Gaussian mixture 
case in Section [2.2| We end up this section with a comparison to classifiers with reject option in 
Section 12.31 


2.1 Definition and properties 

The definition of e-confidence sets relies on the Bayes classifier s* and the cumulative distribution 
function of f*{X). 

Definition 1. Let e g]0, 1], the e-confidence set is defined as follows 




{s*{x,)} if f;(/*(x.)) > i-e 
{0,1} otherwise, 


where FJ is the cumulative distribution function of f*{X) and /*(•) = max{77*(-), 1 — ? 7 *(-)}. 

According to this definition, the construction of the e-confidence sets relies on two important 
features. First, if a label is assigned to a new feature X, by the e-confidence set, it is the 
one provided by the Bayes classifier s*. Second, we assign a label to a new data X, if the 
corresponding score f*{X,) is large enough regarding the distribution of f*{X). This is one of 
the key ideas behind conformal predictors introduced in |VGS05| . 

The following assumption is fundamental to establish theoretical guarantees. 


(Al) The cumulative distribution function FJ of f*{X) is continuous. 

One of the main motivations of the introduction of the e-confidence set is that, if Assump¬ 
tion (Al) holds, the procedure ensures an exact control of the probability Q of assigning a 
label 

V{T:)=F{F^nx,))>l-e)=e. (3) 

This happens since F^{f*{X,)) is uniformly distributed under Assumption (Al). Moreover, 
under this assumption as well, one can rewrite the e-confidence sets in a different way. Indeed, 
for e g] 0, 1[, we have 


F} (r (X.)) > 1 - e ^ r (X.) > (f;)-^( 1 - e). 


where (F^) ^ denotes the generalized inverse of the cumulative distribution function Fj (see 
|vdV98| l. Therefore, if we set = {Fj)~^{l — e) for e G]0, 1[ and Oi = 1/2, Definition i 
equivalent to 

'{s*(X.)} ifr(X.)>ae 
{0,1} otherwise. 

Next, we provide the most important property of the e-confidence sets: 


r.*(x.) = 


IS 


(4) 


Proposition 1. Denote by the set = {Tg; P(rs) = ej. Let Assumption (Al) be satisfied. 


1. For any e g]0, 1], the e-confidence set satisfies the following property: 


7^(^*) = min 7^(^A . 
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2. For e €]0,1] and for any Fg G the following holds 

0 < (r«) - 7^ (F^) = i {E [\2r^*{X,) - l\lc] + 

- ae\lAoUBo\ +E [|1 - - ae|l^iUBi]}, (5) 

where = (Fj)“^(l — e) and 

Ay = {/‘(X.) > a^,card(r4X.)) y = 0,l, 

By = {r(X.)<a^,card(r4X.)) = l,s(X.) 7 ^y}, j/ = 0,l, 

C = {/*(X.) > a^,card(r4X.)) = l,s*(X.) 7 ^s(X.)}. 


Several remarks can be made from Proposition]^ First, for e g] 0,1], the e-confidence set is 
optimal in the sense that its risk is minimal over r^, the class of all confidence sets that assign 
a label with probability e. Second, the excess risk of a confidence set is directly linked to the 
behavior of the function /* around a^. This observation will play a major role in our main result 
related to rates of convergence in the next section. Third, note that if we apply ([^ with e = 1, 
which implies = 1/2, we obtain the classical result in classification 

R{s) - R{s*) = E [\2f^*{X,) - l|l{..(x.)^.(x.)}] • 


Let us conclude this section by stating a result that specifies the behavior of the risk associated 
with the e-confidence set w.r.t. the parameter e: 


Proposition 2. The function e >->■ 7?.(F*) is non decreasing on ]0,1]. 


This result shows an expected fact: the larger the rejecting probability, the smaller the error. 
In particular 


7^(F*) < i?(s*) VeG]0,l]. 


2.2 e-confidence sets for Gaussian mixture 

In this section, we apply the e-confidence set introduced in Definition to the particular case of 
Gaussian mixture model. We set ft = with d G N \ {0}. Let us assume that the conditional 
distribution of X given Y is Gaussian and that, for simplicity, the marginal distribution of Y is 
Bernoulli with parameter 1/2. To fix notation, we set 

X|y = 0-Af(Aio,S) and X\Y = I N{yii,Y), 

where hq and fii are vectors in and S is the commun covariance matrix. We assume that S is 
invertible and denote by || • Hs-i the norm under S“^: for any fj, G we have ||/r|||,_i = yc 

where fjJ stands for the transpose of fi. The following theorem establishes the classification error 
of the e-confidence set F* in this framework. 

Proposition 3. For all e g]0, 1], we have 

^ P(d>(^)-bd>(^+||m-dolls-0<£) ^ 

where Z is a standard normal random variable and is the standard normal cumulative distri¬ 
bution function. 
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The proof of this proposition is postponed to the Appendix. Interestingly, in the Gaussian 
mixture case, we get a close formula for the risk of the e-confidence set. Moreover, this risk 
depends on ||/ri — /xoHs-i as in the binary classification framework which corresponds to the 
particular case £ = 1, where we do not use the reject option and where we get 

7^(^•) = Ris*) = 1 - $ . 


2.3 Relation with classifiers with reject option 

The problem of classification with reject option has already been introduced in |Cho70| . More 
recently the terminology of classifiers with reject option has been defined in |HW06j : a classifier 
with reject option is a measurable function which maps X onto {0,1, Rfi\ where the output R^. 
means reject. For a parameter a € [1/2,1] and for sr a classifier with reject option, the risk 
function considered in |HW06j is 


Lc{sr) = P {sr{X) ^ r , Sfl(A) ^ i?e) + (1 - a) P {sr{X) = Rfi). (6) 

This risk has been studied in the context of classification with reject option in the papers |HW06I 
IWYllj and references therein. We notice that the above risk looks at rejecting as a part of the 
error in the same way as wrong classification. The parameter 1 — a controls the trade-off between 
these two "errors”. In other words, the parameter 1 —a is the cost of using the reject option. This 
is a major difference with our point of view. Indeed, we recall that the probability of rejection 
is a parameter in our setting, and then we do not include it in the risk (|^. As a consequence, 
if we bound the risk ([^ while keeping under control the probability of classifying Q, we are 
able to bound the risk ([^. The reverse is not true. That is, controlling (§ does not provide any 
control on the probability of rejection and one cannot avoid irrelevant use of the reject option 
is some situations. This difference can be significant in some some practical situations where 
the knowledge of V (Fg) is a relevant information. Indeed, when dealing with several examples 
to label, controlling this probability ensures the amount of the data we wish label. Hence our 
methodology prevents from irrelevant use of the reject option. For the same reason, a second 
important feature that differs between both methodologies is that the comparison between two 
confidence sets is easier than the comparison between two classifiers with reject option. Indeed, 
for some e g] 0, I], we say that, for two confidence sets Fg and F^/ such that V (Fg) = V (F^,/) = e, 
the confidence set Fg is “better” than F^/ if TZ (Fg) < TZ (F^/). As the study of the risk function 
Lq, viewed in Equation (§ does not provide any control on the probability of classifying, it is 
much more difficult to compare the performance of two classifiers with reject option on the set 
of labeled data. This point will be made clear in Section with the numerical experiment. 

Let us consider the optimality by now: the paper |HW06| also provides the optimal rule for 
the risk ([^. For each a G [1/2,1], the Bayes rule with reject option s’fi^ is defined such that 

La{sR ) = min La (sr), 

Sr 


where the minimum is taken over all classifiers with reject option. It is characterized by 





if r{X,)>a, 
otherwise. 


(7) 


where /*(•) = max{r 7 *('), 1 —r 7 *(-)} as in our setting. Obviously, the Bayes rule with reject option 
s*j^ can be written in term of confidence sets. This leads to an e-confidence set defined in the 
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same way as in Equation Q. However, there is an important difference: the main contribution 
of the present paper is to provide a methodology to pick the parameter in Q such that the 
probability of classifying an example Q is exactly e. The key to be able to do so is the use of 
the cumulative distribution function of f*{X). In Section 3.1 we will see that the data-driven 
counterpart of the £-confidence defined in Definitionalso controls the probability of classifying. 
Notably, this is possible in a semi-supervised way, that is, only using a set of unlabeled data. 


3 Plug-in £-confidence sets 

This section is devoted to the study the data driven counterpart of the e-confidence sets provided 
by plug-in rule. We provide the construction of the plug-in methods in Section |3.1[ Their 
asymptotic consistency as well as rates of convergence are given in Section |3.2[ 


3.1 Definition of the plug-in ^-confidence sets 

For e s]0,l], the construction of our plug-in e-confidence set relies on a previous estimator 
of the regression function rf. To this end, we introduce a first dataset, which consists 
of n independent copies of {X,Y). The dataset is used to estimate the function rj and 
therefore the functions f* and s* as well. Let us denote by rj, /(•) = max(f}(-),l — rj{-)) and 
s = l{^(.)>i/ 2 } the estimators of rj*, f* and s* respectively. Thanks to these estimations, a data 
driven approximation of the e-confidence set given in Definition can be 

[{0,1} otherwise, 


where Fj: is the cumulative distribution function of f{X). Hence, r*(X,) invokes the cumulative 
distribution function Fp which is unknown and therefore needs to be estimated. We then 
consider a second dataset, independent of denoted by = {Xi,i = where 

Xi ,..., Xf^ are independent copies of X. Based onV^, we estimate the cumulative function F^ 
by the empirical cumulative distribution function of f{X) denoted by Fp Now, we can define 
the plug-in e-confidence set: 


Definition 2. Let e g]0, 1] and f) be any estimator of rj*, the plug-in e-confidence set is defined 
as follows: 

f*(X ) = / Ffihx,)) > 1 - e 
^ * [{0,1} otherwise, 

where /(•) = max{fj{-), 1 - and Ff{f{X,)) = ^ ^{f{x,)<f\x.)}- 

Remark 1. The samples and T>n play completely different roles. The sample is used 
to estimate rj* and then must consist of labeled observations. The second dataset Vn, used 
to estimate the cumulative function Fp requires only a set of unlabeled observations. Hence, 
the construction of the plug-in e-confidence sets does not require more labeled examples than in 
the classical classification setting. This is particularly interesting in some practical situations 
where the number of labeled examples is small while a large number of unlabeled observations is 
available. 
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3.2 Theoretical performance 

This section is devoted to assessing the asymptotic performances of the plug-in e-confidence set. 
The symbols P and E stand for generic probability and expectation, respectively. Let e €]0,1], 
and r* be a plug-in e-confidence set. We define the risk of T* by the natural quantity 

R (f*) = P (s(X.) ^ y.|F;(/(X.)) > 1 - e) . (9) 

Note that we use here the notation R rather than the previous one TZ given by ([^ to stress that 
the probability P is taken under the law of instead of just {X,,Y,). Through¬ 

out this section we assume the following condition on the cumulative distribution function Fp 

which is analogous to Assumption (Al). However, this assumption relies on the estimator / and 
then is not restrictive since it can be chosen by the statistician. 

(A2) The cumulative distribution function of f{X) is continuous. 

We also define the risk of the oracle counterpart T* of T* 

R (r*) = P (s(A.) ^ y.|F//(A.)) > 1 - e) . 

The objective of this section is to prove both that 

P (F;(/(A.)) > 1 - e) ^ e, and (10) 

R(f*)^ n,N^+oo, 


and to derive rates for these convergences. Since is dedicated to the estimation of rj* and Vn 
to the estimation of Fp we prove that 


R (f*) - R (r*) ^ 0, and 

R(^:)-7^(^:) ^ 0, 


( 11 ) 

( 12 ) 


when both n and N go to infinity. The convergences (101 and (111 relies on the Dvoretzky-Kiefer- 
Wolfowitz inequality |Mas90j while the convergence (12) relies on the following inequality. 


Proposition 4. For all e g]0,1], the following inequality holds under Assumptions (Al) and 
(A2) 


0 < R - 7^ (r*) < -{E [\r]*{X,) - Oie\l{\fi{X.}-r,-{X.)\>\ri»(X.)-a,\}] 

+ E [|1 - ri*{X,) - as\l[\px.)-ri-(X.)\>\l-ri-{X.)-a^\}] 

+ ae\Ff{a,) - F}{a,)\}, 

where = (EJ)“^(1 — e). 

The proof of Proposition relies on Proposition For e g] 0, 1], Proposition evaluates 
the loss of performance using the confidence set P* instead of the e-confidence set P*. We can 
distinguish two parts in the upper bound. One part is linked to the classification with reject 
option setting provided by |HW06j and then depends on the behavior of the function / around 











«£. Note that the same quantity is obtained by |HW06] . The second part ae\F^{ag) — Fj{ae)\ 
is related to our proposed confidence set and is due to the approximation of FJ by F^. 

Observe that when s = 1 one can recover a classical inequality in the classification setting. 
Indeed, in this case, = 1/2 and F^(l/2) = F/(l/2). Hence, we obtain 

R - 7^ (r*) < E [|2r7*(X,) - l|l{|f)(x.)-r/*(x.)|>|>?*(^.)-i/ 2 |}] ■ 

Finally, we state our main result which describes the asymptotic behavior of our plug-in 
e-confidence sets: 

Theorem 1. 1. If fi{X,) —>■ r]*{X,) in probability when n —>■ -|-oo, then for any e s]0,1] 

P (E;(/(X.)) > 1 - e) ^ e, 

and 

R(f*) -7^(^:)^o, 

when both n and N go to infinity. 

2. For any e g]0, 1], assume that there exist Ci < oo and > 0 such that 

P{\f{X)-ae\<t)<CiF^, yt>0. (13) 


Assume also that there exist a sequence of positive numbers a„ 
constants C 2 , such that 


-|-oo and some positive 


P {\fj{x) — i]{x) \ > t) < C 2 exp (—, Vt > 0, Va; S X. 


Then we have 


and 


R 


• (E;(/(X.)) > 1 - e) = £ + 0(iV-i/2), 
(f*) - 7^ (f:) = 0(a-^'/2) + 0(iV-V2). 


(14) 


(15) 


(16) 


The proof of this result is postponed to the Appendix. Theorem states that if the estimator 
of 77* is consistent, then asymptotically the plug-in e-confidence set is level-e-confidence set and 
performs as well as the e-confidence set. Moreover, several observations can be made according 


to the second point of the theorem. First of all, we mention the rate of convergence (15) does not 


require any of the assumptions (131 and (141. It only needs the consistency of the estimator of rf 


Second, we mention that the assumption (131 has already been introduced in the classification 


with reject option setting in |HW06| . It is analogous to Tsybakov’s margin condition in |Tsy04[ 
IAT07| introduced in the classification framework. We point out here the fact that if 77 *(A,) 
has a density w.r.t. the Lebesgue measure, the assumption (131 is satisfied with 7 e > 1 for any 


e £]0,1]. Third, if 7 ^ ^ 1, we can get faster rate of convergence. However, this rate cannot be 
better than which is the term due to the estimation of the cumulative distribution 

function F^. This term is however not limiting. Indeed, recalling that the sample size N refers 
to the dataset which can consist only of unlabeled observations, getting large N is not a 
big issue. Hence, we can consider the first term as the leading term in (I 6 I. This 

term relies on Proposition and on the assumption (141 which is crucial to establish our rate 
of convergence. Note that various estimators satisfy this condition such as kernel estimators 
(see |AT07| . for more details). 
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4 Numerical results 


In this section, we evaluate the plug-in e-confidence sets numerically. Moreover, we indicate the 
importance of Assumptions (Al) and (A2). 

4.1 Under Assumptions (A1)-(A2) 

In this section both of the cumulative distribution functions Ff and are continuous. We 
generate (X, Y) according to the following models. 

• Model 1: 

1. the feature X = {Ui ,..., Uio), where Ui are i.i.d from a uniform distribution on [0,1]; 

2. conditional on X, the label Y is drawn according to a Bernoulli distribution with 
parameter r]*{X) defined by logit(? 7 *(X)) = X^ — X^ — X^ + X®, where X^ is the j**' 
component of X. 

• Model 2: 

1. the feature X = (Afi, A/ 2 , A/ 3 ), where A/) are i.i.d from standard Gaussian distribution; 

2. conditional on X, the label Y is drawn according to a Bernoulli distribution with 
parameter •q*{X) defined by logit(r 7 *(X)) = (X^)^ + ^ + sin(X^ -|- X^) -|- 3X^. 

The first model leads to a classification problem which is quite difficult. Indeed, using a 
large dataset of features, we evaluate the distribution function of r]*{X), and then obtain that 
P( 77 *(X) S [0.4,0.6]) ~ 0.5. On the contrary, estimating rj* is easy since logit(? 7 *(X)) is a linear 
function of X. Model 2 provides a more simple classification problem: the estimation of the 
distribution function of ??*(X) leads to P( 77 *(X) G [0.4,0.6]) ~ 0.15. On the other side, the 
estimation of rj* is a little more tricky. 

In order to illustrate our convergence result, we first provide estimation of the risk TZ for the 
e-confidence sets. More precisely, for each model and each eG{^, fcG {!,...,10}}, we repeat 
B = 100 times the following steps: 

i) simulate two data sets Vm and Vk according to the considered model with N = 1000 and 
K = 1000; 

ii) based on we compute the empirical cumulative distribution of /*(X) (this step requires 
only the features); 

Hi) finally, we compute, over 'Dk, the empirical counterparts TZk of the risk TZ of the e- 
confidence set using the empirical cumulative distribution of /*(X) instead of F}. We also 
compute the proportion of classified instances T’k- 

From these experiments, we compute the mean and standard deviation of TZk and T’k- The 
results are reported in Table and illustrated in Figure Next, for each model and each 
e G A: G {1,..., 10}}, we estimate the risk R for the plug-in e-confidence set. We propose 
to use three popular classification methods for the estimation of rj*\ random forest, logistic 
regression and kernel rule based on the Gaussian kernel and window parameter equal to 1. We 
perform the following simulation scheme. We repeat independently B times the following steps: 

i) simulate three dataset Vn,'DM,T>K according to the considered model; 
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Model 1 


Model 2 


e 

TZk 

Vk 

TZk 

Vk 

1 

0.39 (0.01) 

1.00 (0.00) 

0.22 (0.01) 

1.00 (0.00) 

0.9 

0.38 (0.02) 

0.90 (0.01) 

0.19 (0.01) 

0.90 (0.01) 

0.8 

0.37 (0.02) 

0.80 (0.02) 

0.16 (0.01) 

0.80 (0.02) 

0.7 

0.35 (0.02) 

0.70 (0.02) 

0.14 (0.01) 

0.70 (0.02) 

0.6 

0.34 (0.02) 

0.60 (0.02) 

0.12 (0.01) 

0.60 (0.02) 

0.5 

0.33 (0.02) 

0.50 (0.02) 

0.09 (0.01) 

0.50 (0.02) 

0.4 

0.31 (0.02) 

0.40 (0.02) 

0.07 (0.01) 

0.40 (0.02) 

0.3 

0.29 (0.03) 

0.30 (0.02) 

0.05 (0.01) 

0.30 (0.02) 

0.2 

0.27 (0.03) 

0.20 (0.02) 

0.03 (0.01) 

0.20 (0.02) 

0.1 

0.24 (0.03) 

0.10 (0.01) 

0.02 (0.01) 

0.10 (0.01) 


Table 1: For each of the B = 100 repetitions and each model, we derive the estimates TZk of 
the risk and the estimated proportions of classified instances Vk of the ^-confidence sets w.r.t. 
e. We compute the means and standard deviations (between parentheses) over the B = 100 
repetitions. Left: the data are generated according to Model 1 - Right: the data are generated 
according to Model 2. 


ii) based on we compute an estimate, denoted by /, of f* with the random forest, the 
logistic regression or kernel rule procedure; 

Hi) based on Bn, we compute the empirical cumulative distribution of f{X) (we recall that 
this step requires a dataset which contains only the features); 

iv) finally, over Bk, we compute the empirical counterpart Rjf of the risk TZ and the proportion 
Vk of the data which are not rejected. 

From these results, we compute the means and standard deviations of both empirical risks 
and proportions of classified instances for n G {100,1000}. We fix fV = 100 and RT = 1000. The 
results are illustrated in Figure [l] and provided in Table and 

From our numerical study, we make several observations. First, as expected, the risk of the 
e-confidence sets is decreasing with e as observed in Table [l] In both models, the reject option 
contributes to improve the overall misclassification risk. As an example, we see that in Model 2 
the estimated value of the misclassification risk, that is when e = 1, equals 0.22 whereas if e = 0.1 
the estimated value of risk is 0.02 which is a significant improvement. Note that in Model 1, 
the classification problem is quite difficult and then the decrease of the risk seems to be slower 
and a bit less significant. On the other hand, we also observe in Table ^that the proportions 
of classified data match with the theoretical values. Regarding Tables [2 the same comments 
can be made in both models and whatever the used classification procedure. Moreover, some 
features of Table are worth commenting on. For fixed e and for each scenario, the estimated 
risk of the all plug-in e-confidence sets decreases with n which is the size of the sample used 
to estimate the regression function rf. Furthermore, for n = 1000 and viewing Table we 
observe that the estimated risks of the plug-in e-confidence sets are close to the oracle ones. This 
illustrates the convergence result provided in Theorem^ However, we can see that the random 
forest procedure are outperformed by the other procedures (especially when e is small). Indeed, 
the construction of plug-in e-confidence sets relies on the estimator of rj *: better estimators lead 
to better confidence sets. Figure [l] summarizes many aspects of our previous discussion. 
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Model 1 


n = 100 n = 1000 


£ II rforest logistic reg kernel || rforest logistic reg kernel 


1 

0.45 (0.02) 

0.43 (0.03) 

0.47 (0.03) 

0.42 (0.02) 

0.39 (0.02) 

0.42 (0.03) 

0.9 

0.45 (0.02) 

0.42 (0.03) 

0.45 (0.03) 

0.41 (0.02) 

0.38 (0.02) 

0.41 (0.03) 

0.8 

0.44 (0.02) 

0.42 (0.03) 

0.45 (0.03) 

0.40 (0.02) 

0.37 (0.02) 

0.39 (0.03) 

0.7 

0.44 (0.03) 

0.41 (0.03) 

0.44 (0.03) 

0.39 (0.02) 

0.36 (0.02) 

0.38 (0.03) 

0.6 

0.43 (0.03) 

0.40 (0.03) 

0.43 (0.03) 

0.38 (0.02) 

0.35 (0.02) 

0.37 (0.03) 

0.5 

0.42 (0.03) 

0.39 (0.03) 

0.42 (0.03) 

0.37 (0.02) 

0.34 (0.02) 

0.36 (0.03) 

0.4 

0.41 (0.03) 

0.38 (0.04) 

0.40 (0.04) 

0.36 (0.03) 

0.32 (0.03) 

0.34 (0.02) 

0.3 

0.41 (0.04) 

0.37 (0.04) 

0.39 (0.04) 

0.35 (0.03) 

0.30 (0.03) 

0.33 (0.04) 

0.2 

0.40 (0.04) 

0.35 (0.05) 

0.37 (0.05) 

0.34 (0.03) 

0.28 (0.04) 

0.30 (0.04) 

0.1 

0.38 (0.06) 

0.33 (0.06) 

0.35 (0.06) 

0.32 (0.05) 

0.25 (0.05) 

0.27 (0.05) 

Model 2 



n = 100 



n = 1000 


£ II rforest 

logistic reg 

kernel 

1 rforest 

logistic reg 

kernel 

1 

0.26 (0.02) 

0.24 (0.01) 

0.27 (0.05) 

0.24 (0.01) 

0.22 (0.01) 

0.23 (0.02) 

0.9 

0.24 (0.02) 

0.21 (0.02) 

0.25 (0.05) 

0.22 (0.01) 

0.20 (0.01) 

0.20 (0.01) 

0.8 

0.21 (0.02) 

0.18 (0.02) 

0.22 (0.04) 

0.19 (0.02) 

0.17 (0.02) 

0.18 (0.02) 

0.7 

0.19 (0.02) 

0.16 (0.02) 

0.19 (0.04) 

0.16 (0.02) 

0.14 (0.02) 

0.15 (0.02) 

0.6 

0.18 (0.02) 

0.13 (0.02) 

0.16 (0.04) 

0.15 (0.02) 

0.12 (0.02) 

0.13 (0.02) 

0.5 

0.16 (0.03) 

0.11 (0.02) 

0.14 (0.04) 

0.12 (0.02) 

0.10 (0.02) 

0.11 (0.02) 

0.4 

0.15 (0.03) 

0.09 (0.02) 

0.11 (0.03) 

0.11 (0.02) 

0.08 (0.02) 

0.08 (0.02) 

0.3 

0.13 (0.03) 

0.07 (0.02) 

0.08 (0.03) 

0.09 (0.02) 

0.06 (0.02) 

0.06 (0.02) 

0.2 

0.12 (0.03) 

0.05 (0.02) 

0.06 (0.02) 

0.08 (0.02) 

0.04 (0.01) 

0.04 (0.02) 

0.1 

0.10 (0.04) 

0.03 (0.02) 

0.04 (0.02) 

0.06 (0.03) 

0.02 (0.01) 

0.02 (0.02) 


Table 2: For each of the B = 100 repetitions and each model, we derive the estimated risks R^ 
of three different plug-in £-confidence sets w.r.t. e and to the sample size n. We compute the 
means and standard deviations (between parentheses) over the B = 100 repetitions. For each 
e and each n, the plug-in e-confidence sets are based on, from left to right, rforest, logistic 
reg and kernel, which are respectively the random forest, the logistic regression and the kernel 
rule procedures. Top: the data are generated according to Model 1 - Bottom: the data are 
generated according to Model 2. 
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Model 1 (n = 100) 


Model 1 (n = 1000) 




Model 2 (n = 100) Model 2 (n = 1000) 




Figure 1: Visual description of the results reported in Table and For each model and each 
n, we plot, as a function of 1 — e, the mean over the B = 100 repetitions of the estimated 
risks TZk of the e-confidence sets (solid line) and of the plug-in e-confidence sets based on 
random forest (dashed line), logistic regression (dotted line) and kernel rule (dotted dashed line). 
Top: the data are generated according to Model 1 (left: n = 100; right: n = 1000) - Bottom: 
the data are generated according to Model 2 (left: n = 100; right: n = 1000). 
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Model 1 




n = 100 



n = 1000 


£ 

II rforest 

logistic reg 

kernel 

1 rforest 

logistic reg 

kernel 


1 

1.00 

(0.00) 

1.00 

(0.00) 

1.00 

(0.00) 

1.00 

(0.00) 

1.00 

(0.00) 

1.00 

(0.00) 

0.9 

0.90 

(0.03) 

0.90 

(0.03) 

0.90 

(0.04) 

0.91 

(0.03) 

0.90 

(0.03) 

0.90 

(0.03) 

0.8 

0.80 

(0.04) 

0.79 

(0.04) 

0.80 

(0.04) 

0.81 

(0.04) 

0.80 

(0.04) 

0.80 

(0.04) 

0.7 

0.70 

(0.05) 

0.69 

(0.04) 

0.70 

(0.04) 

0.69 

(0.05) 

0.69 

(0.05) 

0.69 

(0.04) 

0.6 

0.61 

(0.05) 

0.60 

(0.05) 

0.61 

(0.05) 

0.60 

(0.05) 

0.60 

(0.05) 

0.60 

(0.06) 

0.5 

0.51 

(0.05) 

0.49 

(0.06) 

0.50 

(0.06) 

0.51 

(0.05) 

0.50 

(0.05) 

0.51 

(0.06) 

0.4 

0.40 

(0.05) 

0.40 

(0.05) 

0.40 

(0.05) 

0.40 

(0.05) 

0.40 

(0.05) 

0.39 

(0.05) 

0.3 

0.30 

(0.05) 

0.30 

(0.05) 

0.30 

(0.04) 

0.30 

(0.05) 

0.30 

(0.05) 

0.29 

(0.05) 

0.2 

0.20 

(0.04) 

0.21 

(0.05) 

0.21 

(0.05) 

0.21 

(0.04) 

0.21 

(0.04) 

0.20 

(0.04) 

0.1 

0.10 

(0.03) 

0.10 

(0.03) 

0.11 

(0.03) 

0.11 

(0.03) 

0.10 

(0.03) 

0.11 

(0.03) 


Model 2 

n = 100 n = 1000 


s 1 

1 rforest 

logistic reg 

kernel 

1 rforest 

logistic reg 

kernel 

1 

1,00 (0.00) 

1.00 (0.00) 

1.00 (0.00) 

1.00 (0.00) 

1.00 (0.00) 

1.00 (0.00) 

0.9 

0.90 (0.04) 

0.90 (0.03) 

0.90 (0.04) 

0.90 (0.03) 

0.90 (0.03) 

0.90 (0.03) 

0.8 

0.81 (0.04) 

0.81 (0.03) 

0.81 (0.04) 

0.80 (0.04) 

0.80 (0.04) 

0.80 (0.05) 

0.7 

0.70 (0.05) 

0.70 (0.04) 

0.70 (0.04) 

0.69 (0.05) 

0.69 (0.05) 

0.69 (0.05) 

0.6 

0.61 (0.05) 

0.61 (0.05) 

0.60 (0.05) 

0.61 (0.05) 

0.60 (0.05) 

0.60 (0.05) 

0.5 

0.51 (0.05) 

0.50 (0.04) 

0.50 (0.04) 

0.50 (0.05) 

0.51 (0.05) 

0.51 (0.05) 

0.4 

0.39 (0.05) 

0.40 (0.05) 

0.39 (0.05) 

0.40 (0.05) 

0.40 (0.05) 

0.40 (0.05) 

0.3 

0.29 (0.05) 

0.30 (0.04) 

0.30 (0.05) 

0.30 (0.04) 

0.29 (0.05) 

0.29 (0.04) 

0.2 

0.21 (0.04) 

0.20 (0.04) 

0.20 (0.04) 

0.20 (0.04) 

0.21 (0.04) 

0.21 (0.04) 

0.1 

0.11 (0.03) 

0.11 (0.03) 

0.11 (0.03) 

0.12 (0.03) 

0.11 (0.03) 

0.11 (0.03) 


Table 3: For each of the B = 100 repetitions and each model, we derive the estimated proportion 
of classified instances Vk of three different plug-in ^-confidence sets w.r.t. e and to the sample 
size n. We compute the means and standard deviations (between parentheses) over the B = 100 
repetitions. For each e and each n, the plug-in ^-confidence sets are based on, from left to 
right, rforest, logistic reg and kernel, which are respectively the random forest, the logistic 
regression and the kernel rule procedures. Top: the data are generated according to Model 1 - 
Bottom: the data are generated according to Model 2. 
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(A2) fails (CART) (Al) fails (kernel) 


£ 

Vk 

Rx 

1 Vk 

R-x 

1 

1.00 (0.00) 

0.27 (0.03) 

1.00 (0.00) 

0.32 (0.03) 

0.9 

0.98 (0.04) 

0.27 (0.04) 

0.90 (0.03) 

0.31 (0.03) 

0.8 

0.90 (0.07) 

0.24 (0.03) 

0.80 (0.04) 

0.29 (0.03) 

0.7 

0.84 (0.10) 

0.22 (0.03) 

0.70 (0.05) 

0.27 (0.04) 

0.6 

0.79 (0.13) 

0.21 (0.04) 

0.61 (0.05) 

0.26 (0.04) 

0.5 

0.75 (0.16) 

0.21 (0.04) 

0.50 (0.05) 

0.24 (0.04) 

0.4 

0.60 (0.14) 

0.18 (0.05) 

0.40 (0.05) 

0.23 (0.03) 

0.3 

0.48 (0.13) 

0.18 (0.06) 

0.30 (0.04) 

0.22 (0.03) 

0.2 

0.39 (0.13) 

0.18 (0.06) 

0.20 (0.04) 

0.20 (0.03) 

0.1 

0.31 (0.12) 

0.16 (0.06) 

0.11 (0.03) 

0.20 (0.04) 


Table 4: For each of the B = 100 repetitions, we derive the estimated proportions of classified 
instances Vk and the estimated risks of the two plug-in e-confidence sets w.r.t. e. We 
compute the means and standard deviations (between parentheses) over the B = 100 repetitions. 
Left: the data are generated according to Model 2, then Assumption (Al) holds; the procedure 
used to build the plug-in £-confidence set is based on CART method, then Assumption (A2) 
fails ~ Right: the data are generated according to Model 3, then Assumption (Al) fails; the 
procedure used to build the plug-in £-confidence set is based on kernels, then Assumption (A2) 
holds. 


4.2 Importance of Assumptions (Al) and (A2) 

In this section, we shed some light on the importance of Assumptions (Al) and (A2). More 
precisely, we study the behavior of plug-in £-confidence sets when one of these two assumptions 
is not satisfied. 

We first consider a case where the cumulative distribution F/ is continuous but not F)-. We 
consider the simulation scheme described in Section [Q) with Model 2 and parameters n = 100, 
N = 100 and K = 1000. But this time, the plug-in £-confidence set relies on the CART procedure 
which involves that the Assumption (A2) does not hold. The obtained results are reported in 
Table I^Left. Two observations can be made. First, judging by the estimated proportions of 
classified instances and by the associated standard deviations, we are not able to control these 
proportions. Therefore, one of the important feature of our procedure fails. Second, although 
the risk of misclassification is decreasing with £, this decrease is quite slow and cannot be as 
important as observed with plug-in confidence sets studied in Section 4.1 (see Table |^. Indeed, 
for CART method and more generally if Assumption (A2) does not hold, the proportion of 
rejected data is usually not large enough. 

Next, we study the reverse case where the cumulative distribution function Fj is continuous 
but not Ff. We consider the following model. 


• Model 3: 

1. the feature X = U, where U follows a uniform distribution on [0,1]; 

2. conditional on A, the label Y is drawn according a Bernoulli distribution with pa¬ 
rameter 


W - + 5l{l/4<X<l/2} + 5l{l/2<<3/4} + gl{3/4<X}- 


Then, for this model, Ff is not continuous. Moreover, we have that, for £ G [0.5,1], P(r*) = 1 
and 7?.(r*) = 3/10, and for £ g]0,0.5[, F(r*) = 1/2 and 7^(r*) = 1/5. For this model, we use 
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the simulation scheme described in Section |4.1| with the plug-in e confidence sets which relies 
on the kernel rule and the samples sizes n = 1000, N = 100 and K = 1000. The results are 
provided in Table ight. As a remark, we first note that since Assumption (A2) is satisfied 
the proportions of classified instances match with the theoretical values. Second, the estimated 
risk of misclassification of the plug-in e-confidence set decreases with e. However, from our point 
of view, it is irrelevant to compare the performances of the e-confidence sets and those of the 
plug-in e-confidence sets. Indeed, except for e = 1, the proportions of classified data differ. As 
an example, if e = 0.7, the estimated risk of the plug-in e-confidence set is equal to 0.27 which 
seems better than the risk of the e-confidence set. But, for e = 0.7 the proportion of the classified 
instances is larger for the e-confidence set and equals 1. 

5 Conclusion 

In the classification with reject option framework, we introduce a new procedure that allows 
us to control exactly the rejection probability. The construction of the e-confidence sets and 
their plug-in approximations relies on the cumulative distribution function of the score functions 
f* and /. Theoretical guarantees, especially rates of convergence, involve the continuity of 
these cumulative distribution functions. Numerical experiments emphasize the importance of 
the continuity assumption. As viewed in Section the plug-in e-confidence set is defined as a 
two steps algorithm whose second step consists in the estimation of the cumulative distribution 
function Fj. Interestingly, this step does not require a set of labeled data that is suitable for 
semi-supervised learning. In a future work, we intent to generalize our procedure to the multiclass 
case and study procedures based on empirical risk minimization. 


6 Appendix 

This section gathers the proofs of our results. 

6.1 Proof of Proposition 

We first define the following events 

Ay = {r(A.)>a„card(r,(A.))^l,s*(A.)^y}, 2 / = 0,l, 

By = {r(A.) <a„card(r,(A.)) = l,s(A.) y = 0,1, 

C = {/rA.)>a„card(r,(A.)) = I,s*(A.)^s(A.)}, 

T) = ^0 U .Ai U ^0 U 131 , 

and the random variable 

u = l{s(X.)#Y..card(r,(Jf.)) = l} “ (X. )#Y. ,card(r* (X. )) = !} • 

Since 7^(rs) = T’(r*) = e, the proof of the proposition relies on the decomposition of the 
conditional expectation of U given X, over the sets C and T). We have 

E [Ulc\X,] = - V(^.)l{.*(x.)=o} + 

(77*(A.) - l)l{,.(x.)=i} + (1 - V(^.))l{.*(x.)=o}}. 
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Since, = 1 and s*(X,) = 0 imply respectively that ri*{X,) > 1/2 and r]*{X,) < 1/2, we 

obtain from the above decomposition 

E[t71c]=E[|2r;*(X.)-l|lc]. (17) 


Next, 


E [U1t,\X,] = v*{X,)1b, + (1 - V*{X,))1bo - V*{X,)1a, - (1 - r]*{X,))lAo (18) 

Since, 7^(rs) = 7^(r*) = P(/*(X,) > a^), we deduce 

P(card(r,(X.)) = l,r(X.) < a,) = P (card(r,(X.)) ^ IJ*{X,) > a ^), 


which implies 


(1 - ae)E [IboubJ - (1 - ae)E = 0- 


Therefore, adding this null term to (181, we obtain 


E [t/lp] = E [(a, - (1 - v*{X,))1b, + {ae - V*{X,))1 bo] 

+ E [iv*{X,) - a,)lAo + ((1 - V*iX,)) - a,)lAA ■ (19) 


Note that. 


r{X,) < ae 

nx,) > ae and s*(X.) ^ 1 
r{X,) > a, and s*(X.) ^ 0 


(oe - (1 - r]*{X,)) > 0 and (a^ - r]*{X,)) > 0 

(l-r;*(X.)-ae) >0 

(77*(X.)-a,)>0. 


Hence, from (19), we can write 


E [C/l„] = E [\v*{X,) - Oell^oUBo] + E [|1 - v*{X,) - . 


Combining this result with 0 shows that E [t/lcux>] ^ Oj and provides in the same time the 
desired result. 


6.2 Proof of Proposition 

We first prove the following inequality for a, d G [0, l/2[, a < d 


P(s*(X.) ^ Y,)\r{X,) > d) < P(s*(X.) ^ Y,)\r{X,) > a). 


( 20 ) 


Since for e, e g] 0, 1] one has e < e , a direct application of ( |^ yields the proposition. 

In order to prove (201, recall that 


X 

y 


1 

2yt 


{{x - z){t + y) + {x + z){t - y)), Va;, z G K, V?/,tGM\{0}. 


(21) 


Thus, if we define 

Cl = V{s*{X,)^Y„r{X,)>a)-V{s*{X,)^Y„r{X,)>a) 

C2 = p(r(x.)>a)+p(r(x.)>d) 

Cs = ¥{s*{X,)^Y„r{X,)>a) + ¥{s*{X,)^Y„r{X,)>a) 

C4 = P(r(X.)>d)-P(r(X.)>a), 
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from ( 21 ), we have 


P(s*(X.) ^ Y,\r{X,) >a)< P(s*(X.) ^ Y,\r{X,) >a)^ + C 3 C 4 > 0 . 

Since P(s*(X,) yf = 1 — f*{X,), we deduce that 


C 1 C 2 + C^Ci — E [(1 — /*(X,))1{c,<j»(x.)<q}] E [l{/*(X.)>a} + l{/*(X.)>a}] 

- E [(1 - /*(^.)) (l{/*(X.)>a} + l{/*(X.)>a})] E ■ 

Note that E [l{/*(X.)>a}] = E [l{a</*(X.)<a}] + E [l{/'>(X.)>a}] • 

Hence, from the above decomposition, we obtain 

C 1 C 2 + CsCi = 2E [l{a</*(X.)<a}] E [f*{X,)l[f.,(^x,)>a}] 

- 2E [l{/*(X.)>«}] E [f* {X,)l[a<f*{X.)<a}] ■ 

Since, 

E[l{a<r(x.)<a}]E[r(X.)l{;.(;f.)>5}] > dP (o </*(X.) < d) P(r (X.) > d) and 

E [l{/-(x.)>a}] E [r (^.)l{c.</-(x.)<a}] < dP(a < r (X.) < d)P(r (X.) > d), 


we deduce Inequality (20 1 . 


6.3 Proof of Proposition 

This section is devoted to the proof of the result related to the Gaussian mixture model. Before 
starting, let us state a few properties that will be often used. 

Let us write for short /i and /o instead of 77 * and 1 — ? 7 * respectively, so that / = max{/o, /i}. 
Hence, we can write: 


rj*{x) = fi{x)=F{Y =1\X = x) 


P(X = x\Y = 1) 

P(X = x\Y = 1) + P(X = x\Y = 0) 


Pi{x) 

pi{x) +po{xy 


for any x & X. Then, for y = 0,1, using the fact that given Y, = y, the random variable 
XyY~^{fj.o - Pi) ^ {PyY~^{po - Pi) , \\pi - MoI!e-i) we get, on the event {Y, = y} 


s*{X,)yY, ^ 


nx,) = A.y(X,) 

X^ S {pl-y ~ Py) ~ 

(Xt — Py) S {pl-y 

(^» ~ py) E ^{pi-y 
\\pi - MoIIe-1 


fy{x,) < ^ <S4> log 


Pi-yjX,) 

Py{^*) 


^P'l—y'^ P^—y Y 2^y ^ Py — ^ 
~ py) ~ ~ MoIIe-i > 0 

—^ ~ A^o||s-i > 0- 


> 0 


( 22 ) 


where || • ||s-i denotes the norm under S ||/r|||_i = ^p, for any p G X. 
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6.3.1 Intermediate results 

The proof of Proposition relies on two intermediate results. Then we state them first and give 
their proofs. They bring into play the cumulative distribution FJ. 

Proposition 5. Let y G {0,1}. Conditional on the event Y, = y we have 


F}{h_y{X,)) = $ 



Whi - Mo||s-i 

(^» ~ h-y) ^ ~ h'v) 

IImi - Mo|!s-i 




\hi - MoIIe-1 


- 1 , 


where <i> is the standard normal cumulative distribution function. 

Proof. To prove this result, we need to investigate the function F^{-) = P(/*(X) < •). Let 
a G [1/2,1]. We have 


nnx) <a)= Finx) < a,MX) > fo{X)) + F{r{X) < a, MX) < fo{X)) = 

^FiMX) < a, MX) > MX)\Y = 1) + ip(/i(X) < a, MX) > MX)\Y = 0)+ 

\f{MX) < a, MX) < MX)\Y = 1) + ^P(/o(X) < a, MX) < MX)\Y = 0), (23) 

where we used in the last equality the fact that P is a Bernoulli random variable with parameter 
1/2. As already seen, we have for y G {0,1}, 


fy{x) =F{Y = y\X = x) 


Pyjx) 

Pi{x)+Po{x)' 


Hence, denoting by u the function from [1/2,1) into [l,+oo) defined by u{a) = and fi; 
this notation in the above relation (231, we get 

F{r{X)<a)= \ [P(^€[l,u(a)] |r = l)+p(^^G[l,u(a)[ |y = o) 

;= -(Ai + A2 + A3 + A4). ( 24 ) 

All of the terms Ai,A 2 ,A 3 ,A 4 will be treated in the same way. Then let us consider Ai for 
instance: using very close reasoning as in (221 with y = 1, we have 


Ai = 


P ^0 < log ^ log {u{a)) \Y =1 

P ^0 < - (A - yi)^ S"^(yo - Ml) + ^IImi - Molll-i < log (u(a)) | F = 1 

P fo < z + - MoIIe-1 < 

V 2 ll/ll 

P - Mob-i < ^ < II- ^IlMi - MoIIe-) , 

V 2 IImi-Mo||s-i 2 J 
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where Z is normally distributed. In the same way, we get 


A\ = A\ = 
A2 = A-i = 


IImi - Molls-1 < ^ < log (u(a)) _ ll/ii - Mo||s-i 

2 ~ ~ IImi ~ Mo||s-i 2 

log(u(a)) , |!mi-Mo||s-i 


IImi - Mo||s-i 


< Z < 


IIMi - Molls -1 


Coming back to (241 and using twice the following relation <i)(a;) + 4)(—a;) = 1 for any x € 
which is valid for the normal distribution, we easily get 

F}{a)=F{r{X)<a) 

IImi - Molls-i < (^(Q^)) _ Ha^i - Molls-i 

2 IImi ~ Molls-i 2 

IImi - Molls-i ^ V < ^°g (“(“)) . liAii - Mo||e-i 


+P 
= 4) 


IIMI - MoIIe-1 


log(u(a)) ll/ii - Aio||s-i 


Imi - Molls-i 


(j) 


log(u(a)) ||mi-Mo||e-i 


WfJ-i - Molls-i 


- $ - 


IIMI - Molls -1 


4) 


Ml ~ Mo||s-i 


= 4) 


Imi-MoIIs-i , log ( 11 ( 0 )) 


IImi - Molls -1 


- 4) 


IImi-MoIIs-i log(u(a)) 


Imi - Mo||e-i 




||Mi-Mo||s-i log(u(a)) ||^i-^o||s-i , log(u(Q;)) 


IImi - Mo||e-i 


IImi - Mo||s-i 


(25) 


At this point, we are ready to evaluate the quantity Fj{fi-y{X,)) on the event {Y, = y} with 
y € {0,1}. Indeed, according to (251, we only need to evaluate ^ for a = fi-y{X,). 

Thanks to (221 we can write when Y, = y 

l 0 g(u(/l-^(X.))) _ 10 g(^ Py(X.} ) {X, - Hyf E-\fJ,l_y - Hy) 1 _ 

I 1 mi-Mo||e-i IImi-Mo||e-i IImi-Mo||e-i 2 ^ 


Finally, using (25), we get when Y, = y, 


f;(a_,(a.)) = 


Z e 


1mi - MoIIe-1 - 


{X^ My) ^ (Mi— y My) (^» My) ^ (Mi— y My) 


T v-1/' 


IImi - Mo||e-i 


^ ( {X, M,)^S~HMl-y My) \ _ ^ / 11 ^^ _ _ 


iMi ~ Mo||e-i 
\T v-1 /• 


I!mi - moIIe-1 

{X, - Hy)^ - y,y) 

I!mi - Molls-i 


4> 


(A, fj,y) S fly) 

IImi - Mo||s-i 


^ ( (A. Hy) S ^{yi-y fj,y) 

® -hT; -Al- IImi -Molls -1 - 1, 

\ IIMi-Mo||e-i 


where we have also used the relation 4)(a;) + 4)(—x) = 1, for x G M in the last line, since 4) is the 
normal cumulative distribution function. This ends the proof. □ 


The next result is the key tool in the proof of Proposition 
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Proposition 6. Let e g]0, 1]. For y G {0,1}, we have 

P(s*(X.) ^ r. , F; inx,)) >l-e\Y,=y) 
{<i>(Z) + <i>(Z-||Mi-/xo||s-i)}>2-e , Z> 


IIa^i - moIIe-i 


where Z ~ A/” (0,1). 


Proof. Let e g] 0, 1]. For y G {0,1}, according to the first equivalence stated in (22), we observe 
that 

P (f; (r (X.)) > 1 - e , s*(X.) ^Y,\Y,=y) = 

P (f; {h-y{X,)) >l-e, nx,) = h-y{X,) \Y, = y). (26) 

Moreover, using the last equivalence in (|2^, we have, if Y, = y 


r{X,) = h_y{X,) (27) 

IIa^i - Mo||s~i 2 

Then we just need to rewrite the event |f^ (/i_y(X,)) > 1 — e|, when conditioned on the event 
{y, = y}, in a convenient way. Using Proposition]^ we can write that when Y, = y 


F*f{h_y{X,)) = $ 


{X, y,y) S ih'l-y My) \ 

11/^1 - MoIIe-1 J 

{X, — fiy) S ^(p,l_y — fj,y) 


$ 


IIMi - Molls -1 


- Wh-i - Molls-1 - 1- (28) 


Plugging \21\ and ( |28[ ) into ( |26[ ), we finally then get 

P (f; (/(X.)) > 1 - £ , s*(X.) ^Y,\Y, = y) 


(Z.) + $ (Z. - IImi - Mo||e-i)} > 2 - e , Z. > 


IImi - moIIe-i 


where Z, ~ A/" (0,1). The last equality is due to the fact that given Y, = 1, the random variable 
is normally distributed. We then get the desired result and the proof of 


y,y) E {fil—y y.y) 




the proposition is completed. 


□ 


6.3.2 Proposition]^ 

Let e g] 0, 1]. Since P(y, = 1) = P(y, = 0) = 1/2, we have 

F{s*{X,) ^ Y, , F; inx,)) > 1 - e) = 1 {P(s*(X.) ^ Y. , F} (/*(X.)) > 1 - £ | Y. = 1) 

+P(s*(Y.) ^ Y. , F; inx,)) > 1 - £ I Y. = 0)} . 

Next, using Proposition]^ we get 

Fmx,)^Y, , F} inx,)) > 1 - e) 

= p(^{d>(Z) + ci>(Z-||Aii-/xo||s-0}>2-£, Z> 11^1 

= P ({d> (Z) + $ (Z - ll/ri - /toIIe-O} > 2 - £). 
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The last equality is due to the following property: 


Z < 


IIMI ~ Molls-1 
2 


- ll/ii 


MoIIe-i) < 



Molls-I 

2 




f IIMi - Molls-1 
V 2 


which implies that 

^ ^ $ (Z) + $ (Z - ll^ii - Molls-i) < 1 < 2 - e. 

The end of the proof is straightforward and follows from the relation <i)(a;) + <i>(—cc) = 1, Vx G K. 
Indeed, we have 

P($(Z) + $(Z-||Aii-/ro|!s-i)>2-e) = P ($ (-Z) + $ (-Z + - MoIIe-O < e) 


since Z and —Z equal in law. This ends the proof. 


6.4 Proof of Proposition 

We first define the following events 

Ay = {r{X,)>a,JiX,)<a„s*iX,)^y}, y = 0,l 

By = inx,) <a,JiX,)>a„s{X,)^y}, y = 0,l- 

Cy = {r{X,)>a,JiX,)>a„s*{X,)^s{X,),s*{X,)^y},y = OA. 

Since 7^(r*) = e, we can apply Proposition 2 and then, as 

\2v*{X,) - 1| < \r,*iX,) - 0,1 + |1 - ri*{X,) - a,|, 

we deduce that 

[\r]*{X,) — Oell^oUBoUCoUCi] + E [|1 — r]*{x,) — ae|l^iUBiUCoUCi]}- (29) 

Now, 

1. on 4-0, f* = r]*, ??*(4,) > a, and f{X,) < d,, 

hence, we have |77*(X,) — a,] < \rj{X,) — r]*{X,)\ except if a, < d, and f{X,) G (ae,de); 

2. on Bo, f = fj, -nix,) > d, and f*{X,) < a„ 

hence, we have \ri*{X,) — a,| < \fj{X,) — ri*{X,)\ except if d, < o, and f{X,) G (de,^,); 

3. on Co, f* = rj*, f = I - fj, r]*{X,) > o, and fi{X,) < 1/2, 
hence, we always have \r]*{X,) — 0,1 < \fj{X,) — r]*{X,)\; 

4. on Cl, f* = l-r]*,f = fi and fj{X,) > d,, 

hence, we have \r]*{X,) — a,| < \rj{X,) — r]*{X,)\ except if d, < o, and f{X,) G (d,,o,). 
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Since Aq,Bq,Cq and Ci are mutually exclusive events, we deduce 


E - ae|l^oUBoUCoUCi] < E [|r 7 *(X,) - ae|l{|j7(Js:.)-,)*(X:.)|>|r)*(Jf.)-ae|}] + 


E 


• (30) 


ael + l{BoUCi,ae<a,,/(X.)e(<ie,«e)} 

In the same way, we obtain the following decomposition 

E [|1 - 'n*{X,) - a£|l^iuBiuCouCi] < E [|1 - r]*{X,) - ae|l{|ij(x.)-j)*(x.)|>|i-i)*(x.)-ae|}] + 


E 


|l-7?*(X.)-ael(v„ 


ae<“e./(^.)G(ae,de)} {BiUCo ,de <«£ ,/(X, ) G (de ,c«e ) } 


)]■ (31) 


Since {Ay, By,Cy),y = 0,1 are mutually exclusive events, and that |? 7 *(J'f,) — ag\ < and 

») ~ ^e\ ! 

(f:) -7^(^:) 


|1 — r]{X,) — Oel < tte, it derives from Inequalities (29), (30) and (31) that 
1 


Rl 


< 


{E [\r]*{X,) - as\l{\fj(X.)-r,-iX.)\>\ri-(X.)-ae\}] + 

E [|1 - V*{X,) - Ols\l{\fj(x.)-rj-(X.)\>\l-ri-(X.)-a,\}] + 

ae\Fjiae) - F^iae)\}. 

To conclude the proof, it remains to note that 1 — e = F^{aA) = F^{aA, for all e €]0,1]. 


6.5 Proof of Theorem [T] 

We first set a Lemma that will be used in the proof. 


6.5.1 Tool lemma 

The following lemma is inspired by Lemma 3.1 in |AT07j . 

Lemma 1. Let X be a real random variable, {Xn)n>i cl be sequence of real random variables 
and to G K- Assume that there exist Ci < oo and 70 > 0 such that 

Px(|X-to| < <5) < V(5>0, 

and a sequence of positive numbers a„ —>■ + 00 , C 2 ,C^ some positive constants such that 
Px„ {\Xn — X\> (5|A) < (72 exp {—CsanS^) , V(5 > 0, Vn G N. 

Then, there exists (7 > 0 depending only on Ci,C 2 and C^, such that 

|E [l{x„>io} - l{Jf>to}] I < E [|l{x„>to} - l{X>a,}|] 

< P(|A„-A| > |A-to|) 

Proof. The following inequality holds 

|l{X„>to} - l{X>to} I < l{|X„-X|>|X-to|}- 

Hence, it remains to prove 

P(|7f„-X|>|X-to|)<Ca7^“/". 
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We define, for 5 > 0, 


Ao = {|A-fo|<<^} 

Aj = {21-^6 <\X-t^\<2^}, j>l. 

Since the events (Aj)j>o are mutually exclusive, we deduce 


P {\Xr, - X\ > \X - to\) = 


< 


< 


< 


j>0 

Px(|A-to| <5) + E®[i {|X„-X|>2J-i5}lAj] 

J>1 

[Px„ (|A„ - A| > \X) 1a,] 

i>i 

CiS^» + Cl(72(5^“ exp {-C2an2^^-^S^) , 

j>i 


since Pa(Aj) < Pa(|A — to| < < (2^6)'^°. Therefore, choosing S = we obtain from 

the above inequality. 


P(|A„-A| > |A-t|) < 

< 


Cia7^“/2 + 2CiC3a7^'>/2 ^ 2^^“ exp(-C222^-2) 

t>i 

Ca-^“/^ 


for a constant C > 0. 


□ 


6.5.2 Theorem [T] 

Let e G]0, 1[ We first prove that for N large enough 


(F;(/(A.)) > 1 - e) - P (f//(A.)) > 1 - e) 


< 


C 

7W’ 


and 


R (f*) - R (r:) 


< 


c 

7F’ 


(32) 

(33) 


where C, C > 0 are constants which do not depend on n. For all x G [1/2,1], 


F^-(x) = [P (f(X) < xlV, 


Hence, conditional on I?„, F^(x) is the empirical cumulative distribution function of /(A), 

where / is view as a deterministic function. Therefore, for all 7 > \/log(2)/2iV, Dvoretsky- 
Kiefer-Wolfowitz Inequality yields 


Pi^« {\FfiKx,)) - Ff{f{X,))\ > A.) 


< 

< 


Px,„ sup 

\xG[1/2.1] 

2exp(-2A72), 


7(x) 


Unix) 
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where Un{x) = P 
Applying Lemma [L we get 


E 


{'Dn,X.) 




< 


c 

71’ 


where C does not depend on n. Hence, we obtain Inequality (32l. In the same way, we have 

C 


(F;(/(A.)) > 1 - s(A.) ^ n) - p {PfifiX,)) > I - e, s(A.) ^ F.) 


< 


Vn' 


Therefore, Inequality (33) holds for some constant C > 0. 

Since, by Assumption (A2) P ^F^(/(A,)) > 1 — = £, Inequality (32) yields 

P (A/(/(A.)) > 1 - e) = £ + 0(fV-i/2). 


Now, we conclude the point 1) of the theorem. Since Inequality (33) ensures that 


R 


(f;)-R(F:) 


0, n, TV —>■ + 00 , 


it remains to prove that 


R 


(r:) 


0, n —> + 00 . 


Applying Proposition 3, we obtain for Sn > 0, —> 0 


(r*) - (r:)| < 2<5„ + 2P mx,) - 7J*{X,)\ > 5„) + \F^ia,) - F}{a,) 


R 


Since, fj{X,) —)• t]*{X,) in probability when n —>• +oo, f{X,) —F f*{X,) in distribution and 
Ff{ae) — FJ{as) —> 0. Moreover, P {\f/{X,) — t]*{X,)\ > Sn) —> 0 which concludes the point 1) 
of the proof. 

Finally, to prove 2), it remains to show that 


R 


(r:) -7^(^:)| = o(a;;^^/2). 


We first note that, 
Ff{a,) - F}{a,) 


< E 

< E 




+E |l{/(X.)>a,} - l{/*(^.)>«A|l{l»7(^.)-F(X.)l<h*(^.)-l/2|} 
< PmX,)-V*iX,)\>\r,*{X,)-a,\) 

+P mx,) - rj*{X,)\ > \7j*{X,) - (1 - a,)|) 

+E [|l{,j(X.)>aa - l{r)*(^.)>ae}l] ' 


Therefore, applying both Proposition 3, Lemma and the above inequality, we get the desired 
result. 
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